This project relies on data from New York Cityās Taxi and Limousine Comission (TLC). NYC publishes this TLC data for all trips taken by Yellow Taxis, Green Taxis, For Hire Vehicles, and High Volume for High Vehicles. We rely on the Yellow Taxi data, as this is the transportation method most people use and are familiar with. NYC makes full trip data available starting in 2009, organized by month. Each month contains data on roughly 7 million trips. Given the size of this data, we are choosing to work with only data from 2019. A significant amount of data is available for each trip. The dataset contains information on: pickup and droppoff times, pickup and dropoff locations, rate cod, payment tipe, fare amount, credit card tips, and total amount.
For this project, Jason and I received permission from Professor Brambor to work in a group of two. The two of us had been planning before this semester begins to work on a project using traffic data from New York City. We also plan to expand on the scope of this project during the summer, applying machine learning models to this dataset. Given these factors, and our shared interest in this topic, working in a group of two is most effective for us.
Our website is primarily split into two portions. For the first part of our visualization, we used various forms of ggplots to demonstrate insightful points for duration of trips. This was done in the form of time series using a combination of aggregation and tailored configuration for each plot. For the second part of our visualization, we used variables that are most suited and ideal to be illustrated on maps for insightful analysis. This was done through leaflet maps with the implementation of aggregated data and custom feature additions.
Please keep in mind that all the graphs are interactive in nature. Instead of focusing a specific variable, our group thought it would be more useful to give user the control to use the already-available insights from the visualization based on his/her individual needs, either in the perspective of the consumer or taxi driver. Our team would also like to point out that all the preprocessing and data scrubbing, as well as exploratory data analysis, are included as part of our process book, and is excluded from this section. This website only includes our final data viusalization outputs, in accordance with the instructions laid out in the class website.
When aggregating data, our team used median for all variables, apart from tip amounts. This was done because the data contained a reasonable of outliers that might skew the average during aggregation. As such, we determined median was the best method to most accurately capture the insights from the data relevant to the variables of interest.
Using ggplots, our group first decided to focus on the duration of the trip. We think there is an interesting story to be told from such variable. From a consumer perspective, it will be useful for someone to know when the trip will be shortest at what hour of the day, what day of the week or at what month of the year. Vice versa, it will also be useful for a cab driver to know when the trip will be the longest or shortest. Even though the same approach could have been incorporate for trip distance and fare amount, we thought this can be best shown on the map, instead of graphs. This is shown in our next section.
From the graph above, the peak rush hour time period is shown to be from 9 a.m. to 6 p.m. with the median duration hovering around 12 minutes, and the duration gradually declines after 6 p.m. The lowest points are around 5 a.m. or 6 a.m.. This is reasonable considering that most people work from 9 a.m. to 6 p.m., and people frequently use cabs throughout the working period in the bustling city of New York. Our group further dissected the duration over a day by comparing its trends by the day of the week. One can clearly see how weekdays follow a similar trend of following a sharp increase in duration of trips from 7 a.m. to 12 p.m., which is followed by a gradual decline. This effect is similar to the aggregated graph above, but the effect is more pronounced and the rush hour seems to start earlier. However, there is a clear difference in this trend when looking at the weekends; Saturday and Sunday both follow a gradual increase from 7 a.m. to 7 p.m. (Saturday) or 3 p.m. (Sunday). As such, one can clearly see people start their day more slowly on these days.Note: you can selectively click on a single day by double clicking on the desired day of the week on the legend box of the interactive map. This would apply to any interactive maps with the legend embedded within the graph.
Now, we shifted gears to study how the duration varies by the day of the week. This almosts seems like a normal distribution curve with Monday and Sunday having the two lowest duration (10.5 and 10 minutes), while Thursday marks the highest peak with around 12 minutes in duration. This trend was also reflected in the overall trend in previous graph when comparing day to day with Thursday having the highest duration on average, while Sunday had the lowest. Our group thought it would also be interesting to analyze the duration of weekdays by month, and compare month to month. It seems like June and October consistently share the highest peak throughout the days of the week, in general. This may be attributed to major holidays (i.e.Ā summer break) or the start of schools (i.e.Ā fall semester for universities), where people tend to travel more than other months of the year. In terms of day to day, thursday seem to have the highest duration of trips over any other day of the week, and Sunday have the lowest. This pattern is consistent with what we saw with the graph above.The following maps present different aspects of Yellow Taxi activity within New York City. These maps are divided into Taxi Zones defined by the city. Map visualizations have the advantage of clearly highlighting featuresĀ by geography. The first set of maps examines the tipping behavior of passengers organized by pick up zone. The next set looks more broadly at NYC traffic patterns, analyzing which zones were more congested. The final map examines taxi activityĀ as a whole.
This map displays the average credit card tip amount for each NYC Yellow Taxi pickup zone. This data was filtered to only include trips payed by credit card, as the NYC TLC dataset does not record cash tips. Interestingly, Manhattan pickup zones have relatively low average tips, while the other boroughs appear to higher average tips generally.Though mapping average tip data effectively shows the trend of which zones tip higher, a more accurate method of analyzing tips is to analyze by tip percent, since most riders generally tip a percentage of the total trip cost. This map below displays the average tip percent by taxi zone. This map tells a similar story to the previous visualization. Fun Fact: the average tip percentage in our dataaset for 2019 is 18%.